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Abstract — We consider the problem of distributing a file in a 
network of storage nodes whose storage budget is limited but at 
least equals the size file. We first generate T encoded symbols 
(from the file) which are then distributed among the nodes. We 
investigate the optimal allocation of T encoded packets to the 
storage nodes such that the probability of reconstructing the file 
by using any r out of n nodes is maximized. Since the optimal 
allocation of encoded packets is difficult to find in general, 
we find another objective function which well approximates 
the original problem and yet is easier to optimize. We find 
the optimal symmetric allocation for all coding redundancy 
constraints using the equivalent approximate problem. We also 
investigate the optimal allocation in random graphs. Finally, we 
provide simulations to verify the theoretical results. 

I. Introduction 

A file in a distributed storage network can be replicated 
throughout the network to improve the performance of retrieval 
process, measured by routing efficiency, persistence of the 
file in the network when some storage locations go out of 
service, and many other criteria. Most of the studies in network 
file storage consider a common practice where every node in 
the network either stores the entire file or none of it. In an 
important article, Naor and Roth [ 1 1 studied how to store a 
file in a network such that every node can recover the file by 
accessing only the portions of the file stored on itself and its 
neighbors, with the objective of minimizing the total amount of 
data stored. By applying MDS (Maximum Distance Separable) 
codes and generating codeword symbols of the file, they pre- 
sented a solution that is asymptotically optimal in minimizing 
the total number of stored bits, when the original file has a 
length much larger than the logarithm of the graph's degree 
of the storage network. Other works Q, Q extended the 
result of [ 1 ] and devised algorithms for memory allocation in 
tree networks with heterogeneous clients. Distributed storage 
is also studied in sensor networks 0], Q. In sensor networks, 
the focus is usually on the data retrieval assuming that a data 
collector has access to a random subset of storage nodes while 
in this paper we address the allocation problem. 

One of the appealing features for a distributed storage 
system is the ability to scale the persistence of data arbitrarily 
up and down on-demand. In other words, the cost of accessing 
the stored data should be adjustable based on the demand. In 
one extreme, all the nodes have "easy" access to the stored 
file, either by storing the whole file or a large part of it. On the 
other extreme, just a single node stores the file entirely and 
other nodes need to fetch the file from that node. It is clear that 



by making more copies of a file and spreading those copies in 
the network, the retrieval of the file becomes easier. The use of 
MDS codes provides the flexibility to increase the persistence 
of a file gradually. For example, for a given file of size F, we 
can generate T symbols using a (T, F) MDS code such that 
every ^-subset of those T symbols is sufficient to reconstruct 
the original file. We call T the budget considered for the file. 
Now, the question is as to how increasing the budget of a file 
affects the retrieval process. In order to answer this question, 
we need to consider a model for data retrieval. Recently, 
Leong et. al. (6) investigated this problem and introduced the 
following model for the network. Consider a network with n 
storage nodes. We distribute a file of size F and budget T 
(packets or symbols) among these storage nodes. Then, we 
look at all the possible subsets of size r of the storage nodes. 
We say that a specific r-subset is successful in recovering the 
file if the total number of packets stored in that subset of 
the nodes is at least the file size F. We are to find the best 
assignment of these T symbols to n storage nodes such that 
the maximum number of the r-subsets of storage nodes have 
enough number of symbols to reconstruct the file. The rational 
behind the model is that in a real storage network, every node 
can be reached by all the other nodes in network. Once a 
retrieval request for a file is received by a node in network, 
the node tries to fetch all the parts of the file and respond to 
the request. The cost of fetching the parts from different nodes 
is not equal (other nodes may be down, busy, etc.). Therefore, 
in the model we assume that each node fetches the necessary 
parts of the file from the other i — 1 most accessible nodes. 

In general, this problem is quite challenging and the optimal 
allocation is non-trivial. In (6), the authors provide some 
results for the symmetric allocation and probability- 1 recovery 
regime which is a special case of the problem introduced 
in JTJ. Symmetric allocation refers to a scheme where, based 
on the budget, we split the storage nodes into two groups: 
the nodes with no stored symbols and the nodes that store the 
same number of symbols. In probability- 1 recovery regime, all 
the nodes should be able to reconstruct the file. As illustrated 
in (6 1, the optimal allocation is not obvious even if we only 
consider the symmetric allocations. 

For very low budgets, we observe that the budget is concen- 
trated over a minimal subset of storage nodes in the optimal 
allocation. On the other hand, for high budget levels, we 
observe a maximal spread of budget over storage nodes. It 
is of interest to determine as to how this transition occurs and 



also to study the behavior of the optimal allocation versus 
budget. In this paper, we take the initial steps towards the 
characterization of the optimal allocation. In section [U] we 
give the formal definition of the problem and the model we 
consider. Then, in Section [ill] we prove that an easier to 
solve problem well approximates the original problem. Using 
the alternative approach, we solve the file allocation problem 



for symmetric allocations (Section IV i; we also consider 



symmetric allocations in random graphs. Finally, simulation 
results are provided in Section [V] 

II. File Allocation Problem 
A. Problem Statement 

We are given a file of size F and a network with budget 
T. We generate T redundant symbols using a (T, F) MDS 
code. An allocation of T symbols to n nodes is defined to be 
a partition of T into n sets of sizes x±, . . . , x n , where Xi is 
the number of symbols allocated to the ith storage node. Note 
that Xi = T and x^ > for i = 1, . . . , n. Our goal is to 
find an allocation which maximizes the number of r-subsets 
jointly storing F or more packets. 

Combination networks provide a simple illustration of the 
allocation problem under study. As shown in Figure [T] there 
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Fig. 1. Combination Network. Virtual source node in layer one has a file of 
size F. The solid nodes in layer two represent the storage nodes in network. 
The third layer contains virtual receiver nodes. Each receiver node corresponds 
to an r-subset of the storage nodes. 

are three layers of nodes. A virtual source node in layer 1 
has a file of size F (packets) to be distributed among storage 
nodes in layer 2. There are n storage nodes in layer two which 
represent the actual storage nodes in the storage network. 
Attributed to the file is a budget T. The data retrieval phase 
is visualized in the third layer of the combination network, 
which contains (") virtual receiver nodes. Each receiver node 
corresponds to an r-subset of the storage nodes. We are 
going to find the best allocation of the budget T such that 
the maximum number of receivers R in the combination 
network can reconstruct the file. Indeed, the success of the 
recovery process depends on the budget T. Based on the 
illustration in Figure [T] we use the terms receiver and r-subset 
interchangeably. 

We use these notations throughout the paper: 

- [to] = {1, . . . , to} and [to]* = {0, 1, . . . , to} 



- A r = {(si, . . . , s r ) : Si € A}. Note that there is no limit 
on the number of times an element Sk in set A can be 
chosen in (si, . . . , s r ). 

- AM = {[si, . . . , s r ] : Si € A and Si ^ Sj for i ^ j}. 
In other words, A^ is the set of ordered vectors with 
distinct elements. 

- d" F (-) is an operator on polynomials which truncates to 
the terms of degree less than F with respect to u. 

Furthermore, we use the notation I for indicator function, 
defined as 

/ l if w e Q 
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For an allocation (xi,...,x n ) of T symbols, let 
^ (xi, . . . , x n ) count the number of unsuccessful receivers. 
We can write \I> as 
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where the first sum is over all the subsets of size r of storage 
nodes. Therefore, the allocation problem we consider is the 
following optimization problem: 

minimize [x±, . . . ,x n ) 
subject to 532=i x i =T 

Xi>0, Xi integer 

It is challenging to find the optimal allocation because of 
the large space of possible allocations, non-convexity, and 
discontinuity of the indicator function. Our approach for 
solving this problem is to look into another quantity which, for 
r <C \pn, closely approximates "J but it is easier to compute. 

B. Main Result 

Let ai~ be the fraction of nodes containing k symbols, and 
let c := T/n. The set of constraints on admissible allocation 
with respect to a can be re-written as 
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Given an allocation (x±, . . . , x n ), we can compute the param- 
eters ao, . . . , ctF- Then, we define (f(ao, . . . , ap) as the prob- 
ability that a receiver with access to a uniformly chosen subset 
of nodes s from [n} r (shown by s ~ [n] r ) is unsuccessful in 
recovering the file. We have 



ip(a ,...,a F ) 



x B . < F I a 



(1) 



Our first claim says that ip is a good approximation for 
Theorem 1: 
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Proof: Proof is given in Sec. Ill 



Dealing with the functional <p is simpler than working with 
vp. In the definition of if, the random vector is chosen from 
[n] r where repetition is allowed. As a result, the probability 



generating function of ip has a simple form and is easy to work 
with. The main result of the Theorem[T|is that for r <C y/n, we 
can solve the problem of minimizing (f(ao, ...,aj?) instead, 
which is simpler than solving for the original optimization 
problem. Moreover, this solution is also a good approximation 
of the problem. We will further discuss the discrepancy in the 
optimal solution through an example in the last section. 

From this point on, we will drop the conditioning on a for 
brevity. Please note that tp is just a function of (o.q, . . . , a F ) 
and its value remains the same for all allocations with the 
same (oto, • ■ ■ , a F ). 
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In order to prove Theorem [T] we first derive the lower bound 
on tp and then we prove the upper bound in the lemmas below. 

Lemma 1: For any allocation [x\, . . . , x n ), satisfying a 
given set of a's, the following hold: 
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Proof of Lemma [7J The inequality <|3j follows immedi- 
ately from the definitions of tp and ^ (([T| and Q). ■ 

In order to prove the upper bound, we need to look at the 
total variation between the distributions of a uniform random 
vector s ~[n] r and a uniform random vector s'~[n]' r l. 

Definition 1: The total variation of two probability distri- 
butions /i and i/ona discrete space 51 is defined as 

TV{n,v) = sup|/x(A) - v(A)\. 

A well known integral formula for the total variation between 
two distributions is given by 

TV((j,,u) = - K w ) - V { UJ )\- 

1 uGSl 



Lemma 2: Let f2 — [n] r . Further, let fi be the uniform prob- 
ability distribution over ft, and v be the uniform probability 
distribution over the subset S of f2 consisting of vectors with 
distinct entries; v is on f2 \ S. Then, we have 



TV(//,z/) < 



(r - 1) : 



Proof of Lemma |2j The total number of non-repetitive 
vectors of size r in SI is n(n — 1) . . . (n — r + 1). We use the 
short hand nM for this expression. Then we can write 



III. Discussions and Proof of the Main Result TVf/z, v) 

Consider a receiver which has access to the vector of storage 
nodes s = [si, . . . , s r ], where s is uniformly chosen from 
[n]' r '. Let P s ^r n i[r] [Ei=i x si < F] represent the probability 
that the total number of symbols stored in a randomly chosen 
set of size r of storage nodes s is less than the file size F. 
There are in total n(n — 1) . . . (n — (i — 1)) = r!(") ordered 
vectors like s in [n]^ (note that the subsets in [n]H are 
ordered). Although working with ordered sets is slightly more 
complicated, as we will see shortly, this will help us in finding 
a better approximation for 5*. 

The total number of unsuccessful receivers ^(xi, . . . ,x n ) 
can be calculated easily if we have the probability P s ~[n] r tr,at 
a receiver with access to a randomly chosen subset of nodes 
s is unsuccessful in recovering the file. In the definition of 
we are only concerned with the total number of unsuccessful 
receivers. If we choose s from a space like [n]M where order 
is important, we need to eliminate the effect of over-counting. 
Here, since s ~ a division by r! is sufficient. Hence, 

the functional 'ffxi, . . . , x n ) can be re-written as 
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Lemma 3: 

r\^(xi, ...,x n ) 
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Proof of Lemma ^ Using the results of Lemma [2] and 
definitions of ^ and tp, we can write 
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The first inequality above follows from the triangle inequal- 
ity, and the second from Lemma [2] 

■ 

The proof of the Theorem [T] follows from Lemma [T] and [3] 

IV. Optimal Symmetric Allocations 
Following the results of the previous section, for cases 
where r <C y/n, we have 2( - r ~ 1 ' 1 <C 1 and therefore, 
finding the optimal allocation of the symbols is equivalent 
to minimizing the function (p(cto, . . . ,a F ). In this section, 
we direct our attention to symmetric allocations. In the case 
of symmetric allocations, we can find the optimal symmetric 
allocation and probability of success for all different budgets 
T. An allocation is called symmetric if we allocate the budget 
T as follows: we pick a number, say j, and we allocate chunks 
of size T/j until we run out of the budget. Now, we have two 
types of nodes: fraction ccq of nodes which are left empty and 
the fraction ctj of the nodes which store j number of symbols. 



Again, the optimal allocation is not obvious even if we 
consider only symmetric allocations. For instance, for very low 
budgets (T « F), we can easily argue that the budget should 
be concentrated over a minimal subset of nodes. For example, 
consider the case where T — F, if we store the entire file over 
one of the storage nodes, then the total number of successful 
receivers is ("I -J). If we break the file into two parts each 
of size F/2, then the total number of successful receivers is 



going to be 
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By using the well-known identity 
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it is clear that the former allocation outperforms the latter. 
Similarly, other symmetric allocations can also be rejected. 
When the budget is very high (T « nF/r), the budget should 
be spread maximally. For example, consider the case where 
T = nF/r. In this case, by spreading the budget over all 
the storage nodes, we can achieve the probability- 1 recovery. 
If one distributes this budget by allocating chunks of size F 
(storing the file in its entirely), he will be worse-off since the 
probability of success will be 



1 
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which is clearly less than 1. This behavior gives rise to 
questions like: "When to switch from minimal to maximal 
spread of the budget?", "Is there any situation where there 
exists a solution other than minimal or maximal spreading?" 

First, we give a useful expression for (p in the lemma 
below, which is simpler to work with. Then, we investigate 
the optimal symmetric allocation. 

Lemma 4: 



ip(a , . . .,a F ) 
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Proof of Lemma^ If s, is a random element of [n], then 
the probability that P(x s . = k) is equal to a k - Therefore, the 
probability generating function of x Si is equal to J2k=o uka k- 
Hence, if s — (s! , . . . , s r ) is a uniform random vector in [n] r , 
then the probability generating function of X)j=i x Si is equal 
to (J2k=o ukot kY ■ It follows then that 



i=i 



a <F 



Xu k a k ) 
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and |5]) is immediate. ■ 
In a symmetric allocation, suppose that the fraction of 
the non-empty nodes is ctj with j number of symbols each. 
Therefore, in the expression of (p(ao, . . . , qj?) at most ao and 
ctj have non-zero values. Our goal is to find the optimal value 
of j. 

In this case, using Lemma |4j the problem of minimizing 

</?(ao> ■ • • > a F) over {a + a j = ljjofj = c } reduces to 
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(a + u^ajY 
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Equivalently, by substituting (1 

l(F-i)/j} 



aj) for ap, we have 



i=0 



for rj > F. (8) 



Notice that for rj < F, the maximum degree of u in |7]i is 
less than F. Therefore, the operator d" F does not eliminate 
any term from the expansion and (p{ctj) = 1. 

Expression |8]l has the form of the binomial distribution 
CDF; The following lemma helps us to determine its minima. 



Lemma 5: The function ip(aj) in 
in all the points j where Lj3fJ ~ [ F 
f(ctj) minimizes over some j 
i. 

Proof: For constants m and 
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Therefore, 
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Oj 2 and thus, 
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tpiatjj <ip(a hJ . 

Lemma [5] reduces the complexity of finding the minimum 
of ([8]) considerably, as it limits the search for optimal j, shown 
by j*, to the set of per node budgets {[^-] : i G [r]}. 
Therefore, finding the optimal symmetric allocation is reduced 
to computing the probability of successful recovery of the 
original file when a 3 * fraction of the nodes contain j* portion 
of the file and the rest of the nodes are empty. 

In order to find the optimal value j*, we derive the prob- 
ability of successful decoding of a random receiver. Suppose 
that only d out of r of storage nodes to which a receiver has 
access are non-empty. In this case, the receiver can recover the 
file only if d > i. Therefore, the probability of successful file 
recovery when each non-empty storage node has portion 
of the file is 
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T 
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(9) 



which has the from of the CDF of hyper-geometric distri- 
bution. We have to evaluate this function for all i 6 [r] and 
choose j* such that the highest success probability is achieved. 
Note that given r the solution can be found in constant time 
since by Lemma [5] we just need to evaluate |9| r times. 

A. Symmetric Allocation in Connected Random Graphs 

In a practical network, a node cannot connect (via single 
hop) to every subset of r nodes. As a first step towards prac- 
tical settings, we investigate the asymptotics of the allocation 
problem in large random graphs. A random graph G(n,p) has 
n vertices, and every two vertices are connected with proba- 
bility p. We direct our attention to connected random graphs 
since they better describe real networks. G(n,p) is connected 
iff p is greater than a critical value ^p 1 . If p = d lo . s - for 
some constant d, then G[ 



d log 



n, "*^° - I is connected with high 
probability and every vertex has degree r x log n Q . 

Suppose that we want to store a file of size F and budget 
T in such a graph provided that each node could reconstruct 
the file by accessing its 1-hop neighbors. We are interested 



in maximizing the probability that a node is successful, as 
the number of nodes n in the network grows. It is clear that 
the budget T should also grow in order to maintain a certain 
success probability for receivers. Otherwise, probability of 
successful recovery of the file will be 0. Given T, the mean 
number of symbols per node is T/n and therefore the mean 
number of symbols a node has access to is equal to — . Since 
the file size is assumed to be constant, the most important 
regime to study is when — x /i, where /i is a constant. 

In this regime, every one of the random variables 
x Sl , . . . ,x Sr , representing the number of symbols in every 
chosen node, is a non-negative random variable with the 
expectation \ijr. Standard limit theorems ( J8) ) imply that 
the random variable Yji=i x »i w ^ follow approximately a 
Poisson distribution. Consider the case r x dlogn and 
T = fin/r. For i = l,...,F, define A, so that a, = — 
and let Xi, . . . , Xp be independent Poisson random variables 
such that Xi follows Poisson(fc; A,*) = Afe Ai jk\. Then, classic 
approximation theorems (J9)> iflOl ) imply that the random 
variables Yli=i x si an d z3j=i behave similarly. In fact, 
their difference in total variation obeys the following bound 

Tv(E^,E«*)=o( r '-). 

\=l i=i > V log nJ 
Therefore, it is the case that 
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In the symmetric case, we allocate either or j symbols. 
Hence, at most An and A, have non-zero values. Since in 



symmetric case we have jet; 
becomes 



r/i, the previous expression 
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Similar to the result in the previous section, since 
e~ x Y^k=o x i s a decreasing in x, in order to find the 
optimal j, we just need to evaluate the above expression for 
j e {[F/z] : i € [r]} and the optimal value j* is the one 
which maximizes the success probability. 

V. Simulation Results and Conclusion 



We numerically investigated the results of section IV 
through some simulations. Due to the complexity of the 
problem, finding the true optimal allocation for large n is 
not practical. In order to verify our results, we compare the 
approximate solution with optimal (found by searching all 
symmetric allocations) for two different examples. First, for 
n = 10 and r = 2, optimal symmetric allocation consists of 
two parts: for T/F £ (1, 4.5), the file should be stored entirely 
and, for T/F > 4.5, all storage locations should store half of 
the file. As shown in Figure [2] approximate solution gives 
correct allocation for this case. For the second case, where 
n = 15 and r = 5, the optimal allocation is more complicated. 
We observe that the choice of j/F = 1 remains optimal until 



T/F = 4.5. Then, for T/F e (4.5,4.65), the optimal number 
of nodes to use is 9 (= \T/j*\) and each of them store half 
of the file. Finally, we observe a transition that spreads the 
file maximally over all storage nodes. It is interesting that in 
this case our approximate solution again matches the optimal 
symmetric allocation. Figure [3] plots the probability of success 
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Fig. 3. Probability of success vs normalized budget for n = 15 and r = 3 
in the symmetric allocation where each node stores j/F fraction of the file. 

In general, we observe a transition from concentration of 
budget over minimal number of nodes to maximal spreading 
of the budget over all storage nodes as the budget increases 
(this observation is also reported in J6|). This transition is not 
sharp as we observed that there are cases where the number 
of non-empty nodes is neither of the extremes. Also, where 
the transition happens is not trivial to determine and for each 
budget the optimal allocation should be computed using the 
machinery developed in this paper. Finding useful algorithms 
in order to find the optimal allocation in general sense and 
also for more realistic scenarios remains of interest. 
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